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Abstract - We analyse optimum reject strategies for 
prototype-based classifiers and real-valued rejection mea¬ 
sures, using the distance of a data point to the closest pro¬ 
totype or probabilistic counterparts. We compare reject 
schemes with global thresholds, and local thresholds for 
the Voronoi cells of the classifier. For the latter, we de¬ 
velop a polynomial-time algorithm to compute optimum 
thresholds based on a dynamic programming scheme, and 
we propose an intuitive linear time, memory efficient ap¬ 
proximation thereof with competitive accuracy. Evaluat¬ 
ing the performance in various benchmarks, we conclude 
that local reject options are beneficial in particular for sim¬ 
ple prototype-based classifiers, while the improvement is 
less pronounced for advanced models. For the latter, an 
accuracy-reject curve which is comparable to support vec¬ 
tor machine classifiers with state of the art reject options 
can be reached. 

Keywords: classification, prototype-based, distance-based, 
reject option, local strategies 

1 Introduction 

1.1 Motivation 

Classification constitutes one of the standard application 
scenarios for machine learning techniques: Its applica¬ 
tion ranges from automated digit recognition up to fraud 
detection, and numerous machine learning models are 
readily available for this task 0. Often, besides the 
overall classification accuracy, the flexibility of the clas¬ 
sification model to handle uncertain predictions plays 
an important role. Techniques which provide a level 
of certainty together with the predicted class label can 
trade classification security for a partial prediction of 
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the labels; in the latter case, data for which the predic¬ 
tion is insecure are rejected. In particular applications 
which require a life long learning or an adaptation to 
changing conditions benefit from such flexible classifica¬ 
tion models lf32l . Moreover, in safety critical areas such 
as driver assistance systems, health care, or biomedical 
data analysis, the information about the certainty of the 
classification is almost as important as the class label 
itself. Further tests or expert opinions can be consulted 
for uncertain classification to avoid critical effects of a 
misclassification for instance in health care. For driver 
assistance systems, a high degree of uncertainty can re¬ 
sult in turning off the assistance system and passing the 
responsibility back to the human driver. In all settings, 
the possibility of a machine learning classifier to reject a 
classification in case of a low classification confidence 
is crucial. 

Reject options have been pioneered by the formal 
framework as investigated in the approach 1121 : If the 
costs for a misclassification versus a reject are known, 
one can design an optimum reject threshold based on 
the probability of misclassification. In practice, however, 
the exact probability of a misclassification is generally 
unknown. Hence further research addresses the ques¬ 
tion whether reject options can be based on plugin rules 
where only empirical estimates of the misclassification 
probability are used 122. Still, these formalisations 
rely on consistent probability estimates, which are often 
not present for given classifiers. Further, rejection and 
misclassification costs need to be known and constant, 
which is not necessarily the case in particular for online 
settings. Thus, these settings deal with an idealised mod¬ 
elling and are not necessarily applicable for efficient, 
possibly deterministic classifiers in complex scenarios. 

Some machine learning classifiers allow an intuitive 
incorporation of reject options. Naturally, probabilistic 
classifiers can directly be plugged into the framework as 
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analysed in j27l . In particular, probabilistic classifiers 
provide confidence values for which the given scaling 
is meaningful, provided the probability estimation is 
correct. The latter is often not the case since the model 
assumptions need not hold, the model often relies on sim¬ 
plified assumptions, or model priors are chosen based 
on computational feasibility rather than the (unknown) 
underlying truth. Further, the inference of exact proba¬ 
bilistic models is not always feasible, depending on the 
type and size of data and the available ground truth. 

One principled alternative to enhance given classi¬ 
fiers by confidence values is offered by bootstrapping 
0. This, however, requires repeated training with the 
available training data, such that it displays a high com¬ 
putational and memory complexity. Further, it is not 
applicable for online settings where data are not neces¬ 
sarily independent and identically distributed. For online 
learning, the theory of conformal prediction has caused 
quite some interest recently |53l |45 l. The formalism 
is based on a so called non-conformity measure; hav¬ 
ing chosen a suitable criterion, it provides a statistically 
well founded theory to estimate the confidence of a clas¬ 
sification in online settings, where data have to fulfil 
the weaker property of interchangeability only rather 
than being i. i. d., see H531145( 1. In practice, however, the 
choice of the non-conformity measure is very critical 
and suboptimal choices do not lead to meaningful results. 
Further, the original approach is very time consuming 
since it requires the re-training of the model in a leave- 
one-out fashion. Albeit efficient approximations exist 
for some classifiers such as prototype-based models, the 
formal guarantees usually do no longer hold for the latter 

ED- 

There have been attempts to accompany powerful de¬ 
terministic classifiers by efficient ways of confidence 
estimation. One popular example is given for the sup¬ 
port vector machine (SVM), see the approach lt36l for 
two-class classification and the work El for exten¬ 
sions towards multiple classes. These techniques are 
implemented e.g. in the popular LIBSVM HIT] . In 
this article, we are interested in an alternative classifica¬ 
tion paradigm: Prototype-based models which represent 
classes in terms of typical representatives and thus allow 
a direct inspection of the classifier. This feature has con¬ 
tributed to an increasing popularity of these models in 
particular in the biomedical domain, see e. g. EHDEUE), 
by offering an elegant representation which lends itself 
to model interpretability in a natural way ||52l [22] 381. 
Further, the representation of models in terms of few rep¬ 
resentative prototypes has proved useful when dealing 


with online scenarios or big data sets mmm. wwie 
some approaches exist to accompany nearest neighbour 
based classification or Gaussian mixture models (GMM) 
by confidence estimations 150l[28l . first reject options for 
discriminative prototype-based methods such as learning 
vector quantisation have only recently been proposed 
|fT9][2T l. In this article, we will built on the insights as 
gained in the recent approaches 09] ED, and we will 
investigate how to optimally set the thresholds within 
intuitive reject schemes for prototype-based techniques. 

While the threshold selection strategies which we will 
investigate can be used for any prototype-based clas¬ 
sification scheme, we will focus on the popular super¬ 
vised classification technique learning vector quantisa¬ 
tion (LVQ) and its recent more fundamental mathemati¬ 
cal derivatives f33l [43l [40i [42l . LVQ constitutes a pow¬ 
erful and efficient method for multi-class classification 
tasks which, due to its simple representation of models 
in terms of prototypes, is particularly suited for inter¬ 
pretability, online scenarios or life long learning l32l . 
While classical LVQ models mostly rely on heuristics, 
modern variants are based on cost-functions such as 
generalized LVQ (GLVQ) |39| . or the full probabilistic 
model robust soft LVQ (RSLVQ) B31 . LVQ classifiers 
can be accompanied by strong guarantees concerning 
their generalization performance and learning dynamics 
MM- One particular success story links LVQ classi¬ 
fiers to metric learners: These enrich the classifier by 
feature weighting terms which opens the way towards 
a more flexible classification scheme, increased model 
interpretability, and even a simultaneous visualisation of 
the classifier H0][42j|5]. Further, recent LVQ variants ad¬ 
dress the setting of complex, possibly non-euclidean data 
which are described by pairwise similarities or dissim¬ 
ilarities only |25l . Apart from the probabilistic model 
RSLVQ, these classifiers are often deterministic and do 
not provide a confidence of the classification. Further, 
also for RSLVQ, the correctness of the probability es¬ 
timate is not clear since the model is not designed in 
order to correctly model the data probability but the 
conditional label probability only mm. 

In this contribution, building on the results as recently 
published in jl9l which proposes different real-valued 
certainty measures suitable for an integration in a re¬ 
ject option, we investigate how to devise optimum re¬ 
ject strategies for LVQ type classifiers, putting a par¬ 
ticular emphasis on the choice of the threshold for a 
reject. In particular, we are interested in efficient, online- 
computable reject options for LVQ classifiers and their 
behaviour in comparison to mathematically well founded 
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statistical models and the SVM flOl [T4 I . We will com¬ 
pare reject strategies based on one global reject threshold, 
and local reject thresholds which take into account the 
Voronoi tessellation of the space induced by the proto¬ 
types. For the latter, we present an optimum computation 
scheme how to set the threshold for the different Voronoi 
cells, and we also propose a time and memory efficient 
approximation thereof which lends itself to online sce¬ 
narios. We evaluate the techniques extensively using 
different benchmark data and different types of LVQ 
classifiers, also providing a comparison to classification 
with rejection based on SVM. Further, we demonstrate 
the suitability of the devised technique for a real life 
example from a medical domain. 

This contribution is structured as follows: In section 


1.2 we give an overview about existing methods to en¬ 


hance classifiers by reject options. Afterwards in section 
[2] we explain the LVQ training algorithms that we use 
in our experiments and introduce in section [3] the basic 
schemes how to reject based on global or local thresh¬ 
olds. Thereby, we develop a polynomial time scheme 
based on dynamic programming (DP), that allows an op¬ 
timum choice of local thresholds, as well as a time and 
memory efficient greedy approximation thereof. Further, 
we present in section 3.1 suitable certainty measures that 
can be plugged into these reject schemes. In the experi¬ 
ments section [5] we test the techniques using different 
benchmarks and LVQ learning schemes. We illustrate 
the suitability of the methods, whereby we put a partic¬ 
ular emphasis on the comparison of local versus global 
reject schemes, the comparison of an optimum compu¬ 
tation of local thresholds by means of DP versus an 
efficient greedy scheme, and a comparison of the pro¬ 
posed reject schemes for highly flexible LVQ classifiers 
with state of the art reject options which accompany an 
SVM. For all experiments, evaluation will rely on the 
full accuracy-reject curve as proposed in the approach 

m. 


1.2 Related Work 

The following section summarises the state of the art for 
reject options and accompanying certainty measures in 
supervised learning. The approach m highlights two 
main reasons for rejection: 

• Ambiguity: It is not clear how to classify the data 
point, e. g. the point is close to at least one decision 
boundary, or it lies in a region where at least two 
classes are overlapping. 


• Outliers: The data point is dissimilar to any already 
seen data point, e. g. it is caused by noise or it is an 
instance of a yet unseen class or cluster. 

There exist several approaches which explicitly address 
one of these reasons or a combination of both. Mostly, 
reject options are based on a measure which provides 
a certainty value about whether a given data point is 
correctly classified. In the following, we distinguish 
measures which are based on heuristics and approaches 
which are based on estimates of misclassification proba¬ 
bilities or confidences. We primarily focus on techniques 
which have been proposed for distance-based classifiers 
and similar due to their similarity to LVQ techniques. 


Heuristic Measures: For 

^-nearest neighbour (fc-NN) 

IH approaches a variety of 
simple certainty measures 
exist using a neighbourhood 
of a given data point [| T6]!28 1. 

These measures rely on the 
correlation of the label of 
the data point and its neigh¬ 
bours (cp. Fig. |Tj). In these 
approaches, several different 
realisations and combinations of the counting have been 
compared, leading to the result that an ensemble measure 
largely rises the stability of the single measures. The 
approach |48] focusses on effective outlier detection, 
relying on the distances of a new data point from ele¬ 
ments of a randomly chosen subset of the given data. An 
outlier score is then given by the smallest distance. The 
resulting method outperforms state of the art approaches 
such as proposed in 1371 in efficiency and accuracy. 

Sousa & Cardoso m introduce a reject option which 
identifies ambiguous regions in binary classifications. 
Their approach is based on a data replication method. 
An advantage of the proposed strategy is given by the 
fact that no reject threshold has to be set externally, rather 
the technique itself provides a suitable cutoff. 

The approach l47l addresses different neural network 
architectures including multi-layer perceptrons, learning 
vector quantisation, and probabilistic neural networks. 
Here an effectiveness function is introduced taking dif¬ 
ferent costs for rejection and classification errors into 
account, very similar to the loss function as considered 
in mm. Then, different rejection measures based on 
the activation of the output neurons are investigated. 



Fig. 1: Sketch of a possible k- 
NN reject scheme (k — 3). Dif¬ 
ferent symbols indicate differ¬ 
ent classes. Classification of the 
left data point (x) is more un¬ 
certain than the right one (x) 
because all neighbours are of 
the same class for the latter. 
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A very popular ap¬ 
proach to turn the activ¬ 
ity provided by a binary 
SVM into an approxima¬ 
tion of a classification 
confidence measure has 
been proposed by Platt 
|36| . The certainty mea¬ 
sure is based on the dis¬ 
tance of a data point to 
the decision border, i. e. 
the activation of an SVM 
classifier. By means of 
a non-linear sigmoidal function, the distance is trans¬ 
formed to a confidence value. Thereby, the parameters 
of the sigmoidal are fitted on the given training data 
(Fig- [2}- A transfer of this method for multi-class tasks 
is provided by Wu et al. Il54l and it is implemented in 
the popular LIBSVM toolbox ifTll . 

Probabilistic Measures: There exist several ap¬ 
proaches which more closely rely on an explicit proba¬ 
bilistic modelling of the data. As already mentioned, the 
approach lfl2l investigates optimum reject options pro¬ 
vided the true probability density function is known. 
This rule can therefore serve as a baseline provided 
this ground truth is available. In the limit case, this 
reject strategy provides a bound for any other measure 
in the sense of the error-reject trade-of, as proved in ([26]. 
Hansen et al. also extent Chows’s rule lfl2l to near opti¬ 
mal classifiers on finite data sets, and they introduce a 
general scaling to compare error-reject curves of several 
independent experiments even with different classifiers 
or data sets. The work as presented in ll23ll also directly 
builds on lfl2l and more closely investigates the decom¬ 
position of data into different regions as concerns the 
given classes and potential errors. They propose a strat¬ 
egy which is based on class related thresholds for more 
flexibility and better results in practice. The setting that 
reliable class probabilities are unavailable and only em¬ 
pirical estimations thereof are available, is addressed in 
the approach |27l . 

Due to this theoretical background, many approaches 
follow the roadmap to empirically estimate the data dis¬ 
tribution first. Often, GMMs are used for this purpose 
OH ESI. Devarakota et al. mo extend a GMM to es¬ 
timate the insecurity of a particular class membership 
for novel, previously unseen patterns of a new class; this 
estimation can yield to a reliable outlier reject option. 


Vailaya & Jain l50l investigate the suitability of GMMs 
for both, rejection of outliers and ambiguous data. In 
particular, they propose an efficient strategy how to de¬ 
termine suitable reject thresholds in these cases. The 
reliable estimation of GMMs is particularly problematic 
for high dimensional data. Therefore, Ishidera et al. Go) 
propose a suitable approximation of the probability den¬ 
sity function for high dimensionality, which is based on 
a low dimensional projection of the data. 

These approaches while providing baselines against 
which to compare, do not address our setting of 
prototype-based multi-class classifiers. We will rely 
on two ingredients for an efficient reject option: (I) A 
suitable real-valued certainty measure fT9l [20 1 and (H) 
A suitable definition of how to set a threshold for rejec¬ 
tion. In most classical reject schemes as summarised 
above, one global threshold value is taken, and an op¬ 
timum value depends on the respective costs of mis- 
classification versus reject. This, however, relies on the 
assumption of a suitable global scaling of the underlying 
certainty measure, an assumption which is usually not 
met in a given setting. Therefore, we will focus on pos¬ 
sibilities how to define optimum local thresholds, which 
release the burden of a globally appropriate scaling of 
the underlying certainty measure. In particular, we will 
propose efficient schemes how to optimise local thresh¬ 
olds which are attached to the Voronoi cells given by the 
prototype-based model. 

First we introduce prototype-based classifiers and the 
most relevant training schemes used in the following. 

2 Prototype-based Classifiers 

A prototype-based classifier is characterised by a set W 
of £, prototypes (w ; -,c(w/)) € x {1,... ,Z}, whereby 
every prototype w is equipped with a class label c(w). 
Classification takes place by a winner takes all rule 
(WTA): Given a data point x, its label becomes the label 
of the closest prototype 

c(x) =c(w/) with l = arg min d( w;,x) (1) 

w j&W 

where d is a distance measure; a common choice for 
d is the Euclidean distance. The closest prototype w/, 
the winner, is called the best matching unit. Note that 
prototype-based models are very similar to k-NN classi¬ 
fiers E) which stores all training data points as proto¬ 
types and predict a label according to the closest (k — 1) 
or the k closest data points. In contrast, prototype-based 



Fig. 2: Sketch of a binary classifi¬ 
cation setting in SVM. A Sigmoid 
is fitted against the values of the 
bins of the distances from the data 
points to the separating hyperplane. 
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training models aim at a sparser representation of data 
by a predefined number of prototypes. By means of the 
WTA rule, a prototype-based classifier decomposes the 
data into Voronoi cells or receptive fields 

Vj = {x\d(w j ,x)<d{v/ k ,x),Vk^j}, 7 = 1,...,4; 

( 2 ) 

and it defines a constant classification on any Voronoi 
cell given by the label of its representative prototype. 

Prototype locations are usually learned based on given 
data. Assume a training data set X is given with N data 
points (x, x {1,... ,Z}. Z states the number 
of different classes. The goal is to find prototype loca¬ 
tions such that the induced classification of the data is 
as accurate as possible. Classical training techniques 
are often based on heuristics such as the Hebbian learn¬ 
ing paradigm ll33l . yielding surprisingly good results 
in typical model situations, see IS). More recent train¬ 
ing schemes usually rely on a suitable cost function, 
including generalised LVQ (GLVQ) ll39l . its extension 
to an adaptive matrix: generalized matrix LVQ (GM- 
LVQ) l40l , its local version (LGMLVQ) 1(401 with local 
adaptive metrics, and statistical counterparts referred to 
as robust soft LVQ (RSLVQ) 11431 . We will focus on 
GLVQ and its matrix version as a particularly efficient 
and powerful scheme, as well as RSLVQ as a full prob¬ 
abilistic model for which an explicit certainty value is 
directly available. 

GMLVQ: The Generalized Matrix Learning Vector 
Quantization l40l performs a stochastic gradient decent 
on the cost function in l39l with a more general metric 
d\ than the standard Euclidean one. This cost function 
is a differentiable function which strongly correlates to 
the (discrete) classification error: 

Wq = E *(^)- w 

Here, the metric d\ is defined as general quadratic form 


of the corresponding point is correct, hence the costs 
correlate to the overall error and optimise the so-called 
hypothesis margin of the classifier BOl . Note that the 
value (r/t ~ d A)K d X+ d \) is in between (—1,0] for 
points x, which are in the Voronoi cell of the prototype 
w j corresponding to dt. A value close to —1 indicates 
that the data point x, is very close to the prototype and 
the classification is very certain, while a value close to 0 
refers to points at the class boundary or outliers. 

GMLVQ training is derived from these costs 0 by a 
stochastic gradient descent with respect to the prototype 
locations and the metric parameters A. Thereby, either a 
global matrix A is used, or local matrices A j are adapted 
which induce the distance value for the Voronoi cell of 
prototypes w j only: 

d A j(w/,x) = (x — wj) T Aj(x — wj) . (5) 

The algorithm which refers to these local metrics 0 is 
called local GMLVQ (LGMLVQ) l40l . 

RSLVQ: The objective function of Robust Soft Learn¬ 
ing Vector Quantization ll43l corresponds to a statistical 
modelling of the setting. It relies on the assumption that 
data points are generated by a GMM. The probability of 
mixture component j generating data point x is 

t | .\ 1 ( d(w j,x)\ 

p( * w= (2^?)w' ex T^n 

This induces the mixture model 

p{x\W)= £ P(j)-p(x\j) 
i <}<£, 

which describes the probability of having observed the 
(unlabelled) data. The priors sum to one P( j) = 1. 
Label information is incorporated into the model by en¬ 
hancing every mixture component (i. e. every prototype) 
with a class label. Then the probability of having ob¬ 
served the labelled data is given by 


£/a(w,x) = (x-w) r A(x — w) (4) 

with a semi positive definite matrix A. The value 
d A = d,\(wj,Xj) is the distance of a data point x, to 
the closest prototype w j belonging to the same class and 
d A = c/a(w/.,x,) is the distance of a data point x, to the 
closest prototype w k belonging to a different class. <f> 
is a monotonically increasing function, e. g. the iden¬ 
tity or the logistic function. The summands in this cost 
function are negative if and only if the classification 


p(x,y\W)= £ P(j)-p(x\j). 

j:c(v/j)=y 


The objective function of RSLVQ is defined as the log 
likelihood ratio of the observed data 


log L:= £ log 

I <i<N 


\w) 

p(xi\W) 


which corresponds to the optimisation of the likelihood 
of the observed class labels assuming an underlying 
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mixture model and independence of the data. Training 
optimises these costs by means of a gradient ascend with 
respect to the prototype locations. The bandwidth Oj is 
typically set identically for all mixture components, and 
it is treated as a meta-parameter. There exist schemes 
which also adapt the bandwidth B4T]l44i . 

3 Rejection Strategies 

We are interested in rejection strategies for prototype- 
based classifiers or similar models that rely on two main 
ingredients: 

1. A certainty measure which assigns a degree of cer¬ 
tainty r(x) to every data point x indicating the cer¬ 
tainty of the predicted class label, 

2. and a strategy how to reject a classification based on 
the certainty value; suitable reject strategies have to 
take into account that r(x) is not necessarily scaled 
in an easily interpretable or uniform way. This 
means, the exact value r(x) does not necessarily co¬ 
incide with the statistical confidence (which would 
be uniformly scaled in [0,1]), and the scaling of 
the value r(x) might even change depending on the 
location of the data point x. 

First, we shortly review suitable certainty measures 
r(x) before discussing optimum reject strategies based 
thereon. 


3.1 Certainty Measures 

In the recent approaches lfT9l[20i l several certainty mea¬ 
sures have been proposed and evaluated for prototype- 
based classification. We will use three measures which 
scored best in the experiments as presented in 1 191120 1: 

Bayesian Confidence Value: Chow analysed the 
error-reject trade-off of Bayes classification. He intro¬ 
duced an optimal certainty measure in the sense of error- 
reject trade-off m. The certainty value for a data point 
x in case of a Bayes classifier is defined as: 

r(x) = Bayes(x) := max P(/jx) (6) 

where P(yjx) is the known probability of class j for 
a given data point x (Fig. [3] left). This value can be 
interpreted as follows: If the highest probability for any 
class with given x is lower than a defined threshold 6 


the probability of making a mistake is relatively high. 
Classification of such data is insecure according to the 
chosen 0. For a binary problem the Bayes reject rule 0 
defines an interval around the decision border. 


Empirical Estimation of the Bayesian Probability: 

Probabilistic models like the RSLVQ model provide 
explicit estimations of the probability of class j given 
a data point x. We refer to these empirical estimates as 
P(j)x) and they induce a certainty measure of the form 


r(x) = Conf(x) =: max P(yjx) . (7) 

An exemplary result of this measure shows Fig. [3](right). 



Fig. 3: ED Artificial five class data set with the contour lines of 
Bayes jb) (left side) and the contour lines of Conf {7} with respect to 
a RSLVQ model (right side, black squares are prototypes). 


Relative Similarity: The relative similarity (RelSim) 
has been proposed as a certainty measure closely related 
to the GMLVQ cost function (|3]>. see ll39lfT9l . It relies 
on the normalised distance of a data point x to the closest 
prototype d + and the distance of x to a closest prototype 
of a different class d~ (Fig. [4): 

r(x) = RelSim(x) = (8) 

whereby d is the distance measure of the used algo¬ 
rithm (d\ 0 or d\j 0 ). Note that the prototype which 



Fig. 4: Sketch of an artificial three-class setting (different symbols, 
bigger ones are prototypes). For a single data point d + , d~ are shown. 
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belongs to d + also defines the class label of x. The cer¬ 
tainty measure RelSim ranges in the interval [0,1) where 
values near 1 indicate a certain classification and values 
near 0 are an indicator for very uncertain class labels. 

The values d + and d are calculated within GLVQ 
training schemes, hence no additional computational 
costs are caused by this certainty measure for training 
set data. Furthermore RelSim ([8]» depends on the stored 
prototypes W only. Therefore no additional storage is 
needed when computing the certainty of a new unla¬ 
belled data point x. Figure [5] shows the contour lines 
of RelSim ([8]) for an artificial five class problem with 
trained prototypes by the GMLVQ without metric adap¬ 
tation, i. e. An = 1 and A,y =0, i / j. The certainty 
values near the class borders are low, hence the measure 
correctly identifies ambiguous classifications. In addi¬ 
tion, the contour lines have a circular shape, such that 
the certainty measure also correctly identifies outliers 
which have a large distance from the learned prototypes. 


labelling errors are rejected. In general this is not the 
case and a reject measure leads to the rejection of a few 
correctly classified data points together with errors. For 
optimum rejects, the number of false rejects should be 
as small as possible, while rejecting as many as possible 
true rejects. 

° O ° O . " 



Fig. 6: Sketch of an artificial three-class setting (different symbols, big¬ 
ger ones are prototypes) with a reject option. Left: Original model with¬ 
out rejection, three marked points are errors in classification. Right: 
Model with optimal rejection since only the three errors are rejected. 




Fig. 5: Qll Artificial five class data set with prototypes trained by 
GMLVQ (black squares) without metric adaptation. The coloured 
curves are the contour lines of RelSim j8j. Note that a critical region 
for a global threshold is between the second and the third cluster from 
left. The third cluster needs a high threshold because the data points 
are very compact. Applying the same threshold for the second cluster 
would reject most data points in this cluster which is not optimal. 

3.2 Global Reject Option 

A global reject option extends a certainty measure by a 
global threshold for the whole input space. Assume that 

r(x): R m —> R, x i-» r(x) (9) 

refers to a certainty measure where a higher value in¬ 
dicates higher certainty. Given a real-valued threshold 
9 £ R, a data point x is rejected if and only if 

r(x) < 6 . (10) 


3.3 Local Reject Option 

Global reject options rely on the assumption that the 
scaling of the certainty measure r(x) is the same for 
all inputs x. This assumption can be weakened by in¬ 
troducing local threshold strategies. A local threshold 
strategy relies on a partitioning of the input space into 
several regions and a different choice of the reject thresh¬ 
old for every region; this way, it enables a finer control 
of rejection |5Q[ 1 1 8 1 . Following the suggestion in f50l . 
we use the natural decomposition of the input space 
into the Voronoi-cells Vj as introduced in Eq. ([2}. A 
separate threshold 9j e R is chosen for every Voronoi 
cell, and the reject option is given by a threshold vector 
6 = (9 1 ,..., ) of the dimension jj equal to the number 

of Voronoi cells Vj. A data point x is rejected iff 

r(x) < 0j where x £ Vj . 

This means the threshold 0j determines the behaviour 
for the region Vj only. In the case of one prototype per 
class local thresholds realise a class-wise reject option. 
For the example in Fig. [6] a local rejection would lead to 
a three-dimensional threshold vector 9 = ( 61 , 02 , 63 ). 

4 Optimum Choices of Reject 
Thresholds 


An example of this rejection strategy is shown in 
Fig. [6] The reject option operates optimally if only 
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We consider ways how to set a threshold (threshold vec¬ 
tor) optimally for a given classifier. Note that rejection 














refers to a multi-objective: A threshold 0 (a threshold 
vector 0 ) should be chosen in such a way that the rejec¬ 
tion of errors is maximised, while the rejection of cor¬ 
rectly classified data points is minimised. To formalise 
this fact, and a corresponding evaluation criterion, we 
explain some terms which we will use later on, first. 

Assume a given data set A (|A| = N ) with labelled 
data for evaluation. Applying a classification algorithm, 
this set decomposes into a set of correctly classified 
data points L and a set of wrongly classified data points 
(errors) E, i. e. X = LUE. An optimum reject would 
reject all points E, while classifying all points L. Natu¬ 
rally, this is usually not possible using a local or global 
reject option. Using a global (local) reject option by 
applying a threshold 0 (threshold vector 0 ), the data 
set X decomposes into a set of rejected data points 3~q 
and a set of data points remaining in the system A 0 , i. e. 
X = U Xg. We refer to data points 

2z? 0 = ,%~ 0 r\L 

as false rejects because the rejection of correctly classi¬ 
fied data points is undesired. The rejection of errors 

S e = St'e^E 

is desired therefore we call them true rejects. Obviously, 
we can decompose SPq = S’e U2zf 0 . 

For an evaluation, we want to report the accuracy of 
the obtained classifier, taking the rejected points into 
account. This multi-objective can be evaluated by a 
reference to the so-called accuracy reject curve (ARC) 
lf34l . For a given threshold 8 (threshold vector 6 ), this 
counts the accuracy of the classified points 

U(0):=(|L|-|^ 0 |)/|A 0 | (11) 

versus the ratio of the classified points 

t c ( 8 ):=\X e \/\X\. (12) 

These two measures quantify contradictory objectives 
with limits t a ( 8 ) = 1 and t c ( 8 ) = 0 for large 8 (all points 
are rejected) and t a ( 8 ) = |L|/|Xj and t c { 8 ) = 1 for small 
8 (all points are classified, the accuracy equals the ac¬ 
curacy of the given classifier for the full data set). We 
are interested in thresholds, such that the value t a is 
maximised, and t c is minimised. Hence, not all pos¬ 
sible thresholds and corresponding pairs (t a ( 8 ),t c ( 8 )) 
are of interest, but optimum choices only, which cor¬ 
respond to the so-called Pareto front. Note that pairs 


(|2z?g|, |t? 0 |) uniquely correspond to pairs (t a { 8 ),t c ( 8 )) 
and vice versa. 

Every threshold uniquely induces a pair (|jSfg|, \<?e\) 
and a pair (t a ( 8 )f c ( 8 )). We say that 8 ' dominates the 
choice 8 if |Jzf 0 /1 < \S£q | and | Sqi \ > \£g | and for at least 
one term, inequality holds. We aim at the Pareto front 

:= {(|j£?e|,|<£e|)| | 8 is not dominated by any 8 '} . 

(13) 

Every dominated threshold (threshold vector) corre¬ 
sponds to a sub optimum choice only: We can increase 
the number of true rejects without increasing the num¬ 
ber of false rejects, or, conversely, false rejects can be 
lowered without lowering true rejects. 

To evaluate the efficiency of a threshold strategy, it 
turns out that a slightly different set is more easily ac¬ 
cessible. We say that 8 ' dominates 8 with respect to 
the true rejects if = |Jz? 0 | and |<f 0 /| > \S’e\. This 
induces the pseudo Pareto front 

:= {(|2zf 0 |,|# 0 |)( | 8 is not dominated by any 

(14) 

8 ' with respect to the true rejects} . 

Obviously, PPq can easily be computed as the subset of 
by taking the minima over the false rejects. SPq has 
the benefit that it can be understood as a graph where 
\P£q\ varies in between 0 and \L\ and \<S'g\ serves as 
function value. Having computed ffg and the corre¬ 
sponding thresholds, we report the efficiency of a rejec¬ 
tion strategy by the corresponding ARC curve, i. e. the 
pairs(f fl (0),f c (0)): These pairs correspond to a graph, 
where we report the ratio of classified points (starting 
from a ratio 1 up to 0) versus the obtained accuracy 
for the classified points. For good strategies, this graph 
should be increasing as fast as possible. In the following, 
we discuss efficient strategies to compute the pseudo 
Pareto front for global and local reject strategies. 

4.1 Optimum Global Rejection 

For a global reject option, only one parameter 8 is cho¬ 
sen. |2z? 0 | and Sg are monotonically increasing with 
increasing 8 , and |X 0 1 is decreasing. We can compute 
thresholds which lead to the pseudo Pareto front and the 
corresponding pairs (f a (0),f c (0)) in time &(N log N) 
due to the following observation: Consider the rejection 
measure r(x,) as induced by the certainty function (|9| for 
all points x,- £ X and sort the values r(x ;i )<...< r(x i(V ) 
(see Fig. |T|). Additionally Fig.[7]indicates via the symbol 
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Fig. 7: Reject thresholds for an area with 13 data points. The first row 
reports the sorted certainty values r(x;), the second row encodes if a 
data point is correct (+)/wrong (—) classified. In this case, there are 
4 thresholds which correspond to the Pareto front, according to the 
number of signs + (taking into account the fact that point 13 is in E ). 
The third row shows the gain g when increasing the threshold value 0. 


qj £ {+, — } whether the corresponding point is in L or 
in E. We assume that the certainty values are not exactly 
identical, for simplicity; otherwise, we sort the points 
such that the points in L come first. The following holds: 

• Every pair (|Jz? 0 |, \£q\) £ 2?q is generated by some 
0 = r(xjj ) which corresponds to a certainty value 
in this list or which corresponds to °° (i. e. all points 
are rejected), since values in between do not alter 
the number of rejected points on X. 

• Values r(xi k ) with x^ £ £ are dominated by r( x 4 +1 ) 
(or °° for the largest value) with respect to true 
rejects since the latter threshold accounts for the 
same number of false rejects, adding one true reject 
x 4 . 

• Contrary, values r(x, t ) with x, ( . £ L are not dom¬ 
inated with respect to the number of true rejects. 
Increasing this threshold always increases the num¬ 
ber of false rejects by adding to the rejected 
points. 

Hence, the pseudo Pareto front is induced by a set of 
thresholds © corresponding to correctly classified points: 

© := {0 = r (x ik ) | x ik £ L} U {°° | if x iN g L} . (15) 

Obviously, |©| £ {|L|, \L\ + 1} depending on whether 
the last point in this list is classified correctly or not. 
An exemplary setting is depicted in Fig. [7] We refer to 
the thresholds obtained this way as 0(0),..., 0(|©| — 
1) whereby we assume that these values are sorted in 
ascending order. 

In addition, we can compute the gain \g(k)\ which is 
obtained when increasing the threshold from Q(k— 1) to 
Q(k)\ For k = 0,..., |©| — 1, the quantity 

g(k) := {xi | Q(k— 1) < r{xi) < G{k) 1 x i £ E } (16) 

denotes the set of additional true rejects when increas¬ 
ing 0 from Q(k— 1) to the value 0(k) where we define 


0(— 1) := 0. Note that |g(0)| equals the maximum num¬ 
ber of true rejects without any false reject. It can easily 
be computed by one scan through the sorted list of cer¬ 
tainty values, see Fig. [7] Obviously, the set 

<%)= U s(i), * = o.|©| - 1 (17) 

0<i<k 

describes true rejects for the choice 0 := 9(k). Note that 
the loss due to an increase of the threshold from 0 ( k ) 
to 0 (k + 1) is always one, by adding exactly one false 
reject, i.e. |Jz? 0(jfc) | =k. 


4.2 Optimum Local Rejection 

Finding the pseudo Pareto front for local rejection is 
more difficult than for a global one because the number 
of parameters (thresholds) in the optimisation rises from 
one to E ,. First, we will derive an optimal solution via 
dynamic programming (DP) aim Secondly, we will 
introduce a faster greedy solution which provides a good 
approximation of DP. 

For every single Voronoi cell Vj, the optimum choice 
of a threshold and its corresponding pseudo Pareto front 
is given in exactly the same way as for the global re¬ 
ject option: We sort the certainty values of the points 
in this Voronoi cell and look for the thresholds induced 
by correctly classified points (possibly adding °°) as 
depicted in Fig. [7] We use the same notation as for a 
global reject option, but indicate via an additional index 
j £ { \.... .q\ that these values refer to Voronoi cell Vy. 
The correctly classified data points in Vj are Lj := LH Vj, 
misclassified points are Ej := EHVj. A threshold Qj in 
Vj leads to false and true rejects and S’g., respec¬ 
tively. These rejects accumulate as Jzfg = lJ / ..2 ! ' 0 / and 
£& = over the entire classifier, characterising the 
false and true rejects of the reject strategy with threshold 
vector 0. For any separate Voronoi cell, optimum thresh¬ 
olds as concerns the number of true rejects are induced 
by the certainty values of correctly classified points in 
this Voronoi cell, possibly adding °°. These thresholds 
are referred to as 


© ; -:={0;(O),...,0;(|©;|-l)} 


(18) 


equivalent to (15 ' for Voronoi cell Vj only, where | ©y | £ 
{| Lj |. |L,j 4- ljTThese thresholds lead to gains |gj(k)| 
equivalent to (16 1 but restricted to Voronoi cell Vj , with 
true rejects £^ {k) = Y.i<kgj(i ) and false rejects 
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Table 1: Example rejects for three Voronoi cells and their losses/gains 
(global). 
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We are interested in threshold vectors which describe 
the pseudo Pareto front of the overall strategy, i. e. pa¬ 
rameters G such that no Q' ^ G exists which dominates G 
with respect to the true rejects. Obviously, the following 
relation holds: 6 is optimal => every 9j is optimal in Vj. 
Otherwise, we could easily improve 0 by improving its 
suboptimal component. The converse is not true: As an 
example, assume Voronoi cells and thresholds as shown 
in Table [T] Here, we can compare the threshold vectors 
(1,1,1) and (0,0,3). While both choices lead to 3 false 
rejects, the first one encounters 9 true rejects and the sec¬ 
ond one leads to 25 true rejects. Hence the second vector 
dominates the first one with respect to true rejects, albeit 
all threshold components are contained in the pseudo 
Pareto front of the corresponding Voronoi cell. 

Hence we are interested in efficient strategies that 
compute the set of optimum threshold vectors as combi¬ 
nations of the single values in ©,. There exist at most 
|@i | • • • • ■ 0^ | = (?(\L\^) different combinations (using 
the trivial upper bound ff(\Lj\) < 0(\L\) for each |0 ; j, 
we can expect an order ff{\L\/^) provided the Voronoi 
cells have roughly the same size). While it is possible to 
test all possibilities provided a low number of prototypes 
t, is present, this number is infeasible if the number of 
prototypes gets large; this is the case in particular in 
online schemes or applications for big data. In the fol¬ 
lowing, we propose two alternative methods to compute 
the Pareto front that are linear with respect to (j. 

4.2.1 Local Threshold Adaptation by DP 

For any number 0 < n < |L|, 1 < j < 0 < i < |0/| — 1 

we define: 

opt(n, 7 ,i) := 

max{|4| | |jS? e | =n, 

d 

9 k G {0/(0), • • •, Gj(\&j\ — 1)}V£ < 7 , (19) 

GjZ{Qj( O),...,0;(«)}, 

G k = 0,(0) V* > ./} 


The term opt(n,y,i) measures the maximum number of 
true rejects that we can obtain with n false rejects, and 
a threshold vector that is restricted in the sense that the 
threshold in Voronoi cell j is one of the first i thresholds, 
it is any threshold value for Voronoi cell k < j, and the 
threshold for any Voronoi cell k > j is fixed to the first 
threshold value. For technical reasons, it is useful to 
extend the index range of the Voronoi cells with 0 that 
refers to the initial case that all thresholds are set to 0 
which serves as an easy initialisation. Since there are no 
thresholds to pick in Voronoi cell Vq, we define | 0 q| = 1 , 
i. e. the index i is the constant 0 in this virtual cell Vo. 

For opt(«,_/, i), a few properties hold: First, obviously 
the pseudo Pareto front can be recovered from the values 
°Pt(», 4 ,|©sl - 1 ) for n < |L|, since these parameters 
correspond to the optimum number of true rejects pro¬ 
vided n false rejects and free choice of the thresholds. 
Hence an efficient computation scheme for the quantities 
opt(72,7,2) allows to efficiently compute the Pareto front. 

Second, the decomposition of the optimality terms 
along the possible threshold values gives rise to the fol¬ 
lowing Bellmann optimality equation: 

opt («, 2,0 = 

' if n = 0 : ELK(o)I 

if /z >0,7 = 0: — °° 

if n > 0,7 > 0,i = 0 : opt(n,7 - 1, | 0 y—11 — 1) 

( if 0 < n < i,j > 0 : opt(n,j,i — 1) 
if n > i > 0, j > 0 : 
max{opt(n,7,2— 1), 

°pt(n — i,j — 1, ]©;—11 — 1) 

+l^(oH^e.( 0 )l} 

(20) 

This recursion captures the decomposition of the prob¬ 
lem along the Voronoi cells as follows: 

• In the first case, no false rejects are allowed. There¬ 
fore, the gain is characterised by the sum of the 
gains |<g* (0) | over all Voronoi cells; these gains cor¬ 
respond to the minimum thresholds in all Voronoi 
cells which do not reject a correct point. 

• In the second case, the number of false rejects has to 
equal n, but only a trivial threshold with no rejects is 
allowed. Hence this choice is impossible, reflected 
in the default value — °°. 

• In the third case, the threshold of Voronoi cell j 
and all Voronoi cells with index larger than j by 
definition of opt (p~9} are clamped to the first one. 
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Hence, by definition of the quantity opt ( p~9l >, this is 
exactly the same as the term opt(«, j — 1, |© 7 -i | — 
1) where no restriction is posed on Voronoi cells 1 
to j —l, but thresholds are clamped starting from 
Voronoi cell j. 

• In the fourth case, the threshold number i is allowed, 
but it would account for i false rejects in the Voronoi 
cell j with only n < i false rejects allowed. Hence 
we cannot pick number i put a smaller one only. 

• The fifth case considers the interesting setting 

where optimality is non-trivial: The choice of 
threshold number i in Voronoi cell j is possible, 
but it is unclear whether it is optimum. There are 
only two possible choices: The first is to take a 
threshold with smaller index in Voronoi cell j, the 
second is to choose threshold i in Voronoi cell j. 
The first choice leads to opt (n,j, i — 1) true rejects. 
The second choice has the consequence, that i false 
rejects occur in Voronoi cell j, hence we are only al¬ 
lowed to reject at most n — i additional false rejects 
in Voronoi cells 1 to j — 1. In turn, however, there 
are true rejects in Voronoi cell j as com¬ 
pared to only if we would pick the smallest 

threshold in this Voronoi cell without false rejects. 
Hence the optimum number of true rejects which 
can be achieved in this case decomposes into the 
optimum opt(« — i,j— 1, 0/-i — 1) which picks 
the best thresholds for Voronoi cells 1 to j — 1, and 
keeps all larger ones to the smallest possible value, 
and the gain which we obtain be¬ 

cause picking threshold number i instead of the first 
one in Voronoi cell j. 

This recursive scheme can be computed by DP, since, 
in every recursion, the value i or j is decreased, and the 
recursion does not refer to values with larger indices. An 
explicit iteration scheme can be structured in three nested 
loops over n £ {0,..., |L|} followed by j £ { 1,..., |} 
followed by i £ {0, ..., |0 ; j — 1}. Since every evaluation 
of the equation ( |20l > itself is constant time, this results in 
a computation scheme with effort ff(\L\ • ^ • max*. |©*|). 
Memory efficiency is &(\L\ ■ maxj. |0^|), since the recur¬ 
sion for threshold i in Voronoi cell j refers to the value 
i — 1 only, or it directly decreases j. Thus a memory 
matrix of dimensionality 0{\L\ ■ ^) suffices. This DP 
scheme yields the optimum achievable values of true 
rejects; one can easily compute optimum threshold vec¬ 
tors thereof since they correspond to the realisation of 


the maxima in the recursive scheme. Hence a standard 
back-tracing scheme on the matrix reveals these vectors. 
See Algorithm [~i~| for pseudo code. For memory effi¬ 
ciency we reduce the tensor opt (n,j,i — 1) to a matrix 
opt(«,_/). The value of opt (n,j) denotes the maximum 
number of true rejects with n false rejects and flexible 
thresholds in Voronoi cells 1 In this context the 

vector 9{n,j) defines the optimal threshold vector for n 
false rejects and flexible threshold in the Voronoi cells 
1,..., j whereas the Voronoi cells j + 1,..., ^ are set to 
the default thresholds (no true reject). 

4.2.2 Local Threshold Adaptation by an Efficient 
Greedy Strategy 

Albeit enabling an optimum choice of the local thresh¬ 
old vectors for given data, DP as proposed above ( [20] ) is 
infeasible for large training sets since it scales quadrati- 
cally with the number of data: The number of thresholds 
max ; - |0 ; j scales with N, we can expect it is of order 
0{N/£). An even more severe bottleneck is the time 
complexity for DP, which is linear in the number of 
data points, hence it is not suitable for big data or on¬ 
line schemes. Therefore, we propose a direct greedy 
approximation scheme which is inspired by the full DP 
and which yields to an (besides pre-processing) only 
linear method with excellent performance at the price of 
possible sub optimality of the solution. 

The basic idea is to start with the initial setting ana¬ 
logical to opt(0,^, |©^| — 1): All thresholds are set to 
the first choice 0 j( 0), hence no false rejects are present 
and the number of true rejects can easily be computed. 
Then, a greedy threshold increase is done until the num¬ 
ber of true rejects corresponds to the maximum possible 
number (El. While increasing the values, the respective 
optima are stored; here, we directly compute the ARC, 
it would easily be possible to compute the number of 
true and false rejects and the corresponding thresholds, 
instead. 

The greedy step proceeds as follows: Starting from 
n = 0, in each round, the number of false rejects n is 
increased by one (the default case) or more than one (in 
case of ties, which particularly happens if the increase 
of a threshold does not affect the number of true rejects 
but increases false rejects only). This threshold increase 
is always done in the Voronoi cell with maximum imme¬ 
diate gain. More precisely: 

• We consider local gains for each Voronoi cell: 

These values are the numbers of true rejects gained 
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Table 2: Iterations of the greedy algorithm £2] It is shown how the ^ 
false rejects are split to the Voronoi cells Vj. 
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by increasing the threshold index by one in this 
Voronoi cell. In addition, we evaluate global gains, 
that are obtained when accumulating all false re¬ 
jects in one Voronoi cell only, and setting the other 
thresholds to the first one. All local and global 
gains can be computed directly. 

• If a global gain surpasses the local gains, this setting 
is taken and greedy optimisation continues. 

• If a local gain surpasses the global gain, it is 
checked whether this choice is unique, or whether 
more than one Voronoi cell would allow a thresh¬ 
old increase with the same quality. In the former 
case, this increase is carried out, and the greedy 
step continues. 


Having proposed efficient exact and approximate algo¬ 
rithms to determine optimum thresholds, we evaluate 
the results of the reject options for different data sets. 
In all cases, we use a 10-fold repeated cross-validation 
with ten repeats. We evaluate the models obtained by 
RSLVQ, GMLVQ, and LGMLVQ with one prototype 
per class. Thereby, we can combine the models with 
different certainty measures depending on their output: 
Since RSLVQ provides probability estimates, we can 
combine it with the certainty measure Conf. In turn, GM¬ 
LVQ and LGMLVQ lend itself to the certainty measure 
RelSim which is computed already while training. We 
compare our results with a standard rejection measure 
of SVM l36l|54J which is implemented in the LIBSVM 
toolbox HD- 

For numerical reasons, we do not display the setting 
\Xg = 0. In Fig. [8] to Fig. 10 we display the ARC 


averaged over 100 runs per data set and rejection mea¬ 
sure. Note that the single curves have different ranges 
for \X 6 \/\X\ corresponding to different thresholds. To 
ensure a reliable display, we only report those points 
\Xq \/\X\ for which at least 80 runs deliver a value. 


5.1 Data Sets 


• Otherwise, a tie occurs; this is in particular the case 
when the increase of thresholds does not increase 
the number of true rejects: This happens, for ex¬ 
ample, if the considered threshold corresponds to a 
point in a cluster of correctly labelled points; then, 
a threshold increase only rejects points from this 
cluster, but no true rejects. In this case, we allow to 
increase the number of false rejects until the tie is 
broken. 

This procedure is described in detail in Algorithm f2j 
Thereby, we do not explicitly check whether the consid¬ 
ered threshold indices are still in a feasible range; rather, 
we implicitly assume that the corresponding gain is set to 
—°° if the threshold would be infeasible. The algorithm 
does not necessarily provide the optimum threshold vec¬ 
tors and hence an approximation to the quasi Pareto front 
only, but, as we will see in experiments, it is very close 
to it. Unlike the exact algorithm, it works in 0{\L\ ■ ^ j 
time and memory. One example of the algorithmic 
loops is depicted in Table [2] for the gains as shown in 
Table[I] The table shows the picked threshold indices of 
the consecutive iterations of the greedy search. 


For evaluation, we consider the following data sets: 

Gaussian Clusters: This data set contains two artifi¬ 
cially generated overlapping 2D Gaussian clusters with 
means jl x = (—4,4.5), fi y = (4,0.5), and standard devi¬ 
ations a x = (5.2,7.1) and ay = (2.5,2.1). These points 
are overlaid with uniform noise. 

Pearl Necklace: This data set consists of five artifi¬ 
cially generated Gaussian clusters in two dimensions 
with overlap. Mean values are given by jl yi = 3 Vi, ji x = 
(2,44,85,100,136), standard deviation per dimension is 
given by a x = (1,20,0.5,7,11), cy = a v . 

Image Segmentation: The image segmentation data 
set consists of 2310 data points which contain 19 real¬ 
valued image descriptors. The data represent small 
patches from outdoor images with 7 different classes 
with equal distribution such as grass, cement, etc. 13. 

Tecator: The Tecator data set ll49l consists of 215 
spectra of meat probes. The 100 spectral bands range 
from 850 nm to 1050 nm. The task is to predict the fat 
content (high/low) of the probes, which is turned into a 
balanced two class classification problem. 

Haberman: The Haberman survival data set includes 
306 instances of two classes indicating being alive for 
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more than 5 years after breast cancer surgery 0. One 
instance represents three features linked to the age, the 
year, and the number of positive axillary nodes detected. 

Coil: The Columbia Object Image Database Library 
contains gray scaled images of twenty objects t35| . Each 
object is rotated in 5° steps, resulting in 72 images per 
object. The data set contains 1440 vectors with 16384 
dimensions that are reduced with PCA EQ to 30. 

Since ground truth is available for the first two, artifi¬ 
cial data sets, we can use optimum Bayesian decision as 
a Gold standard for comparison in these two cases. 

5.2 Comparison of DP vs. Greedy Opti¬ 
mization 

First, we evaluate the performance of a greedy optimisa¬ 
tion for the computation of local reject thresholds versus 
an optimum DP scheme. The results are compared in 
Fig- [8] Since we are interested in the ability of the 
heuristics to approximate optimum thresholds, ARCs 
are computed on the training set for which the threshold 
values are exactly optimized using DP. 

One can clearly observe that the resulting curves are 
very similar for the shown data sets and the models 
provided by GMLVQ as well as LGMLVQ. Only for the 
Tecator (Haberman) data set the optimum DP solution 
beats the greedy strategy in a small region, in particular 
for settings with a large portion of rejected data points 
(that are usually of less interest in practice since almost 
all points are rejected in these settings). Results on the 
other data sets show a similar behaviour. 

Hence we can conclude that the greedy optimisation 
provides near optimal results for realistic settings, while 
requiring less time and memory complexity. Because 
of this fact we will use the greedy optimisation for the 
local reject options in the following analyses. 

5.3 Experiments on Artificial Data 

Thereby, we report the ARC obtained on a hold out test 
set (which is also not used for threshold optimisation) 
in order to judge the interesting generalisation error of 
the classification models with reject option. The data 
densities for the artificial data sets Gaussian clusters 
and Pearl necklace are known. Hence we can compare 
local and global reject options on these data with the 
optimum Bayes rejection, see Fig. [9] Thereby, RSLVQ is 
combined with Conf as rejection measure, while RelSim 
is used for deterministic LVQ models, relying on the 
insights as gained in the studies (39] GElIi EIIQID. For 


all settings, the performance of the classifier on the test 
set is depicted, after optimising model parameters and 
threshold values on the training set. Results of a repeated 
cross-validation are shown, as specified before. 

Gaussian Clusters: For Gaussian clusters, the global 
and the local rejection ARCs are almost identical for all 
three models. Therefore, in this setting, it is not neces¬ 
sary to carry out a local strategy, but a computationally 
more efficient global reject option suffices. Interest¬ 
ingly, reject strategies reach the quality of an optimum 
Bayesian reject in the relevant regime of up to 25 % re¬ 
jected data points as can be seen in the left part of the 
ARCs. RSLVQ, due to its foundation on a probabilistic 
model, even enables a close to optimum rejection for the 
full regime, see Fig. [9] 

Pearl Necklace: The pearl necklace data set is de¬ 
signed to show the advantage of local rejection as al¬ 
ready mentioned before when referring to Fig. [5] Here it 
turns out that local rejection performs better than global 
rejection for the models RSLVQ and GMLVQ. As can 
be seen from Fig. [9} neither RSLVQ nor GMLVQ reach 
the optimum decision quality, but the ARC curves are 
greatly improved when using a local instead of a global 
threshold strategy. This observation can be attributed 
to the fact that the scaling behaviour of the certainty 
measure is not the same for the full data space in these 
settings: RSLVQ is restricted to one global bandwidth, 
similarly, GMLVQ is restricted to one global quadratic 
form. This enforces a scaling of the certainty measure 
which does not scale uniformly with the (varying) cer¬ 
tainty as present in the data. In comparison, LGMLVQ is 
capable of reaching the optimum Bayesian reject bound¬ 
ary for both, local and global reject strategies, caused by 
the local scaling of the quadratic form in the model. The 
analysis on these artificial data sets is a first indicator 
that shows that local reject options can be superior to 
global ones in particular for simple models. On the other 
side, there might be a small difference only in between 
local and global reject options for good models. In all 
cases, a sufficiently flexible LVQ model together with 
the proposed reject strategies reaches the quality of an 
optimum Bayesian reject strategy. 

5.4 Experiments on Benchmarks 

For the benchmark data sets, the underlying density mod¬ 
els are unknown, hence we cannot report the result of 
an optimum Bayes rejection. For these settings, as an 
alternative, we report the results which are obtained with 
an SVM and the reject option as introduced in j36l l54|. 
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Fig. 8: Averaged accuracy reject curves for dynamic programming (DP) and the greedy optimization applied on artificial and benchmark data 
sets for the relative similarity (RelSim). 


Figure [lOjdisplays all results. 

Tecator: RSLVQ and LGMLVQ provide results 
which are comparable to the SVM, while GMLVQ leads 
to worse accuracy. Note, however, the scaling: Also 
in the latter case, the classification accuracy of the full 
model is about 92 %, which increases to 94 % when re¬ 
jecting 10% of the data. For this regime for GMLVQ, 
the local threshold strategy is slightly better than a global 
one. 

Image Segmentation: For this setting, the SVM 
yields the best classification accuracy of 97 % compared 
to 95 % for LGMLVQ (and less for the other models). 
This fact can be explained by the simpler model pro¬ 
vided by LVQ techniques as compared to SVM, which 
can rely on a more complex classification boundary in 
this setting. Still, the reject strategies for the LVQ mod¬ 
els are highly performant: Rejecting 10% of the data 
enables an increase of the classification accuracy by 3 % 
for LGMLVQ. For the simpler models RSLVQ and GM¬ 
LVQ, again, a benefit of local versus global thresholds 
can clearly be observed. 

Haberman: For the Haberman data, all LVQ models 
display the same ARC as SVM models for the interesting 
regime of at most 25% rejections in the data. For larger 
reject fractions, deterministic LVQ methods are superior 
to SVM models and corresponding reject options. 

Coil: The coil data set allows a high classification 
accuracy reaching 100 %. LVQ models display a slightly 
smaller accuracy for the full data set due to their simple 
form, representing the model by few prototypes only. 
Here, the benefit of reject options is obvious, since it en¬ 
ables to reach 100 % accuracy when rejecting less than 


10 % of the data for GMLVQ (less than 2 % for LGM¬ 
LVQ). The probabilistic counterpart RSLVQ performs 
worse, but again, the superiority of local rejects versus 
global options is clearly apparent for this weaker model. 

Based on these experiments, we conclude the follow¬ 
ing: 

• Reject options can greatly enhance the classifica¬ 
tion performance, provided the classification accu¬ 
racy is not yet optimum. 

• Local reject options yield better results than global 
ones, whereby this effect is stronger for simple 
models for which the classification accuracy on the 
full data set is not yet optimum. For more flexible 
models with excellent classification accuracy for 
the full data set, this effect is not necessarily given. 

• LGMLVQ and the proposed reject option is compa¬ 
rable to SVM and the standard reject option for the 
considered data. 

We would like to emphasise that the models as provided 
by LVQ techniques are sparse as compared to the SVM 
since we use only one prototype per class. Further, the 
proposed global reject option depends on the prototypes 
only, while the SVM technique requires a tuning of the 
non-linearity on the given data [ 36']. 

5.5 Medical Application 

We conclude with a recent example from the medical 
domain. The adrenal tumours data J8]] contains 147 
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Fig. 9: Averaged ARCs for global and local rejection evaluated on the test sets. For RSLVQ Conf J7} serves as rejection measure and for the 
other two models RelSim serves as rejection measure. The Bayes rejection with known class probabilities provides a Gold standard for 
comparison. 


data points composed of 32 steroid marker values. Two 
classes are present: Patients with benign adrenocortical 
adenoma (ACA) or malignant carcinoma (ACC). The 32 
steroid marker values are measured from urine samples 
using gas chromatography/mass spectrometry. For fur¬ 
ther medical details we refer to ® [2|. The two classes 
are unbalanced with 102 ACA and 45 ACC data points. 

Our analysis of the data follows the proposed eval¬ 
uations in S3 El: We train a GMLVQ model with one 
prototype per class. We use the same pre-processing as 
described in (3 El- The data set has 56 missing values 
(out of 4704). GMLVQ can deal with these values by 
ignoring them for the distance computation and update, 
whenever the values are missing. This corresponds to 
a substitution of the values by the average as provided 
by the closest prototype. The same treatment of missing 
values is possible when calculating the RelSim values 
for rejection. For the evaluation of reject options we 
split the data into a train set (90%) and a test set (10%). 
We evaluated the ARC of 1000 random splits of the data 
and the corresponding GMLVQ models. The averaged 
ARCs of the tested reject options can be found in Fig. m 
There is nearly no difference between the curves of the 
global and the local rejection for small rejection rates (up 
to 10 %). For more than 10 % rejection, the local rejec¬ 
tion strategy improves the accuracy more than the global 
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Fig. 11: Averaged ARCs for global and local rejection (test set). We 
use the RelSim as rejection measure. 


one. Its ARC is comparable to the ARC associated with 
SVM rejection computed based on LIBSVM ifTll . This 
can be attributed to the fact that the scaling of RelSim is 
not uniform as compared to the inherent scaling of the 
data in this regime. For SVM, missing value imputation 
has to be done; here we replace the missing values by 
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Fig. 10: Averaged ARCs for global and local rejection evaluated on the test sets. For RSLVQ Conf {7} serves as rejection measure and for the 
other two models RelSim j8j serves as rejection measure. The SVM rejection is used as a state of the art method for comparison. 


the class conditional means, following the suggestion in 
0. On average, the SVM models leads to 31 support 
vectors whereas the GMLVQ models only contains 2 
prototypes. Further, the GMLVQ model provides insight 
into potentially relevant biomarkers and prototypical rep¬ 
resentatives of the classes, as has been detailed in the 
publication (2). The suggested biomarkers, in particular, 
have been linked with biomedical insight 0. As a con¬ 
clusion, the GMLVQ model together with the proposed 
reject scheme offers a reliable and compact model for 
this medical application. 


6 Conclusion 

In this article, we introduced reject strategies for 
prototype-based classifiers and extensively evaluated the 
proposed methods for diverse data sets, thereby compar¬ 
ing to state of the art reject options as present for SVM. 
In particular, we introduced global and local reject strate¬ 
gies and addressed the problem of their efficient compu¬ 
tation. We introduced two algorithms to derive optimum 
local reject thresholds: (i) An optimum technique based 
on dynamic programming (DP) and (ii) a fast greedy 
approximation. While the first is provably optimum, the 
latter is based on heuristics. However, we showed that 
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the results of both solutions are very similar such that 
the fast greedy solution instead of the more complex 
solution via DP seems a reasonable choice. Its memory 
complexity is only linear with respect to the number of 
data, while DP requires quadratic time, and its memory 
complexity is constant as concerns the number of data, 
while DPs memory size depends linearly on the number 
of data points. 

When investigating these techniques for diverse real- 
life data sets, the benefit of local strategies becomes ap¬ 
parent in particular for simple prototype-based models. 
The effect is less pronounced for more complex models 
that involve local metric learning like LGMLVQ. Inter¬ 
estingly, the proposed reject strategies in combination 
with the very intuitive deterministic method LGMLVQ 
lead to results which are comparable to SVM and corre¬ 
sponding reject options. Thereby, the LVQ techniques 
base the reject on their distance to few prototypes only, 
hence they open the way towards efficient techniques for 
online scenarios. 

So far, the reject strategies have been designed and 
evaluated for offline training scenarios only, disregarding 
the possibility of trends present in life long learning 
scenarios, or its coupling to possibly varying costs for 
rejects versus errors. We will analyse in future work 
how to extend the proposed methods to online scenarios 
and life long learning, where according thresholds are 
picked automatically based on the proposed results in 
this article. 
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do 


Algorithm .1: DP(G j [i),S‘^ i s ) ) 

// compute optimum number of true rejects by DP 
// init 

h:=l lil<( 0 )l: 

for k:=0,...,E, 
do opt( 0 ,fc) := h; 
for n := 1,..., \L\ 

ffor k := 0 ,... ,<* 
do opt(n,£) := — 

// loop over number of false rejects 
for n := 1 ,..., |L| 

( // loop over Voronoi cells 

for 7 := 1 , - - - ,<g 

fopt(nj) := opt («,7 — 1 ); 

//loop over thresholds in Voronoi cell j 
//that agree with false rejects 
for i := 1,... ,min{n, |© 7 j — 1} 

'n' := n — i; 
gain := |^.( () | - 
h := opt(n / ,7 — 1 ) + gain', 
if h > opt(n, 7 ') 

, then opt(n,j) := h; 

// compute threshold vector by back-tracing 
// init with default value: first thresholds 
for n := 0,..., \L\ 

ffor k:= I...E, 

do 9[n,k) 0/t(O); 

// back-tracing in the matrix opt 
for n := 1 ,..., |L| 

( j // start in last Voronoi cell 
n! := n; 

i := min(n', |@y| — 1 ); 

while j > 0 

if i = 0 


do< 


do< 


do< 


do 


do < 


then 


do < 


f // threshold 0 

./ := 7 — l; 

[i := min(n', |© 7 j — 1); 
n" := n! — i; 
gain := \Sl, 


else < 


I I ^0,(0) I ; 

h := opt(n ",7 — 1) + gain; 
i',j)=h 

II threshold i 

e{nj) := Gj(i); 


if opt (r; 


then 


else 


j-=j- 1; 

J := min(n', |© 7 j 

I" // threshold smaller 
/:=/—!; 


// return optimum true reject numbers 
// and corresponding threshold vectors 
return (matrices opt [n,k) and 0(n,k)) 


*(0)1 


Algorithm.2: Greedy optimizatoni©,^'),^^) 
// init by first thresholds 

f°r 7 : = 1,...,^ 
do/(7):=0; 

*:=EL l<( 

141 := /t; 

n :=0;s : = 1; 

/ c (s):= 1-|4 |/|X|; 

G(,):=|L|/(|X|-|4|); 

// loop while true rejects can be increased 
while |4| ^ \E\ 

f //most improvement locally 
gain : (/(y;)) |}; 

Igain := arg max ; -{|^ (/ ^ +1) | - 
//most improvement globally 
GAIN := max7{|^. (n+1) | - |<^. (0) |}; 

Igain : = argmaxy{|<f' (n+1) | - |^ (0) |}; 
if GA/tV > ( gain + \S’e\—h) 

ffor 7:= 1,---,^ 
do/(7):=0; 

I (I gain) :=n; 

|4| :=GA//V + /t; 
n := n + 1; 
f if is unique 

f I [Igain ) • = I [Igain ) "b 1 > 
then |4| := |4| Again; 

In := n + 1; 
f // increase false rejects 
o := 1; 
repeat 
fo:=o+ 1; 

gain := 


then/ 


do< 


else( 


else< 


<{«. 


Igain • — 


ojm+o) 


{ argmax ; {|^ (/a)+o 

until I g a i„ is unique; 
n := n + o; 

I [Igain ) • — I [Igain) A O', 
114 ! ;= \s e \+gain; 


l^0 7 (/(7))|} ; 

I “ ^m) 


s := s +1; 

t c (s) := 1 - [n + \£o\)/\X\; 
-1); [t a (i) ;= (|L| —n)/(|A| — [n- 

return (t e ,t«) 


■my 


19 



